Pesquisa | Portal Regional da BVS

Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters.

Chu, Justin; Mohamadi, Hamid; Erhan, Emre; Tse, Jeffery; Chiu, Readman; Yeo, Sarah; Birol, Inanc.

Proc Natl Acad Sci U S A ; 117(29): 16961-16968, 2020 07 21.

Artigo em Inglês | MEDLINE | ID: mdl-32641514

RESUMO

Alignment-free classification tools have enabled high-throughput processing of sequencing data in many bioinformatics analysis pipelines primarily due to their computational efficiency. Originally k-mer based, such tools often lack sensitivity when faced with sequencing errors and polymorphisms. In response, some tools have been augmented with spaced seeds, which are capable of tolerating mismatches. However, spaced seeds have seen little practical use in classification because they bring increased computational and memory costs compared to methods that use k-mers. These limitations have also caused the design and length of practical spaced seeds to be constrained, since storing spaced seeds can be costly. To address these challenges, we have designed a probabilistic data structure called a multiindex Bloom Filter (miBF), which can store multiple spaced seed sequences with a low memory cost that remains static regardless of seed length or seed design. We formalize how to minimize the false-positive rate of miBFs when classifying sequences from multiple targets or references. Available within BioBloom Tools, we illustrate the utility of miBF in two use cases: read-binning for targeted assembly, and taxonomic read assignment. In our benchmarks, an analysis pipeline based on miBF shows higher sensitivity and specificity for read-binning than sequence alignment-based methods, also executing in less time. Similarly, for taxonomic classification, miBF enables higher sensitivity than a conventional spaced seed-based approach, while using half the memory and an order of magnitude less computational time.

Assuntos

Análise de Sequência de DNA/métodos , Software , Animais , Pareamento Incorreto de Bases , Humanos , Filogenia , Alinhamento de Sequência , Análise de Sequência de DNA/normas

Alignment-free clustering of UMI tagged DNA molecules.

Orabi, Baraa; Erhan, Emre; McConeghy, Brian; Volik, Stanislav V; Le Bihan, Stephane; Bell, Robert; Collins, Colin C; Chauve, Cedric; Hach, Faraz.

Bioinformatics ; 35(11): 1829-1836, 2019 06 01.

Artigo em Inglês | MEDLINE | ID: mdl-30351359

RESUMO

MOTIVATION: Next-Generation Sequencing has led to the availability of massive genomic datasets whose processing raises many challenges, including the handling of sequencing errors. This is especially pertinent in cancer genomics, e.g. for detecting low allele frequency variations from circulating tumor DNA. Barcode tagging of DNA molecules with unique molecular identifiers (UMI) attempts to mitigate sequencing errors; UMI tagged molecules are polymerase chain reaction (PCR) amplified, and the PCR copies of UMI tagged molecules are sequenced independently. However, the PCR and sequencing steps can generate errors in the sequenced reads that can be located in the barcode and/or the DNA sequence. Analyzing UMI tagged sequencing data requires an initial clustering step, with the aim of grouping reads sequenced from PCR duplicates of the same UMI tagged molecule into a single cluster, and the size of the current datasets requires this clustering process to be resource-efficient. RESULTS: We introduce Calib, a computational tool that clusters paired-end reads from UMI tagged sequencing experiments generated by substitution-error-dominant sequencing platforms such as Illumina. Calib clusters are defined as connected components of a graph whose edges are defined in terms of both barcode similarity and read sequence similarity. The graph is constructed efficiently using locality sensitive hashing and MinHashing techniques. Calib's default clustering parameters are optimized empirically, for different UMI and read lengths, using a simulation module that is packaged with Calib. Compared to other tools, Calib has the best accuracy on simulated data, while maintaining reasonable runtime and memory footprint. On a real dataset, Calib runs with far less resources than alignment-based methods, and its clusters reduce the number of tentative false positive in downstream variation calling. AVAILABILITY AND IMPLEMENTATION: Calib is implemented in C++ and its simulation module is implemented in Python. Calib is available at https://github.com/vpc-ccg/calib. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Software , Algoritmos , Análise por Conglomerados , DNA , Análise de Sequência de DNA

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA